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Abstract 

We consider capacity of discrete-time channels with feedback for the general case where the feedback is a time- 
invariant deterministic function of the output samples. Under the assumption that the channel states take values in a 
finite alphabet, we find an achievable rate and an upper bound on the capacity. We further show that when the channel 
is indecomposable, and has no intersymbol interference (ISI), its capacity is given by the limit of the maximum of the 
(normalized) directed information between the input X and the output Y , i.e. C = limjv-,oo jt maxI(X — > 
Y ), where the maximization is taken over the causal conditioning probability Q(x N \\z N ~ 1 ) defined in this paper. 
The capacity result is used to show that the source-channel separation theorem holds for time-invariant determinist 
feedback. We also show that if the state of the channel is known both at the encoder and the decoder then feedback 
does not increase capacity. 

Index Terms 

Feedback capacity, directed information, causal conditioning, code-tree, random coding, maximum likelihood, 
source-channel coding separation. 



I. Introduction 

Shannon showed in [1] that feedback does not increase the capacity of a memory less channel, and therefore the 
capacity of a memoryless channel with feedback is given by maximizing the mutual information between the input 
X, and the output Y, i.e. C = maxp(x) I(X;Y). In the case that there is no feedback, and the channel is an 
indecomposable Finite-State Channel (FSC), the capacity was shown by Gallager [2] and by Blackwell, Breiman 
and Thomasian [3] to be 

C NF = lim — max I(X N ;Y N ). (1) 

iV-oo N P(x") 

One might be tempted to think that for an FSC with feedback all that changes for Q to characterize capacity is the 
optimal input distribution, which now must depend on the feedback. However, the following simple counterexample 
shows that there are cases in which the mutual information I(X N ; Y N ) results in a larger quantity than the capacity. 
Let us consider the case where the channel has only one state which is a binary symmetric channel (BSC) with 
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probability of error 0.5. It is obvious that no information can be transferred through this channel even with feedback 

and, therefore, the capacity of the channel is zero. However, it is easy to see that if we let the input to the channel 

at time i equal the output of the channel at time i — i.e. Xi = for i > 2, which is possible in the presence 

of feedback, then it can be easily shown that 

1 1 N 

±I(X N ;Y N ) = ±Y, I(Xi;Y N \X^) 

i=l 

Therefore, we see that for this channel limAr^oo max j^I(X N ; Y N ) — 1, while the capacity of the channel is 
zero. The reason such examples exist is that I(X N ;Y N ) is measuring the mutual information between X N and 
Y N , including the mutual information that is due to the feedback and not due to the channel. This example thus 
indicates that the capacity of the channel with feedback must involve maximization over an expression other then 

I(X N ;Y n ). 

In 1989 the directed information appeared in an implicit way in a paper by Cover and Pombra [4]. In an 
intermediate step [4, eq. 52] they showed that the directed information can be used to characterize the capacity of 
additive Gaussian noise channels with feedback. However, the term directed information was coined only a year 
later by Massey in a key paper [5]. 

In [5], Massey introduced directed information, denoted by I(X N — > Y N ), which he attributes to Marko [6]. 
Directed information, I(X N — > Y N ), is defined as: 

N 

i(x N -^y n )^J2 J ( x '; Y i\ Yi ~ 1 )- (3) 

i=l 

Massey showed that directed information is the same as mutual information I(X N ;Y N ) in the absence of feedback 
and it gives a better upper bound on the information that the channel output Y N gives about the source sequence 
in the presence of feedback. 

Tatikonda, in his Ph.D. dissertation [7], generalized the capacity formula of Verdui and Han [8] that deals with 
arbitrary single-user channels without feedback to the case of arbitrary single-user channels with feedback by 
using the directed information formula. Recently, the directed information formula was used by Yang, Kavcic and 
Tatikonda [9] and by Chen and Berger [10] to compute the feedback capacity for some special finite-state channels. 

Directed information also appeared recently in a rate distortion problem. Following the competitive prediction 
of Weissman and Merhav [11], Pradhan [12], [13] formulated a problem of source coding with feed-forward and 
showed that directed information can be used to characterize the rate distortion function for the case of feed-forward. 
Another source coding context where directed information has arisen is the recent work by Zamir et. al. [14], which 
gives a linear prediction representation for the rate distortion function of a stationary Gaussian source 

In this paper we extend the achievability proof given by Gallager in [2] for the case of a finite-state channel 
(FSC) without feedback to the case of a FSC with feedback. We find an upper bound on the error of the maximum 
likelihood decoder for a FSC with time invariant deterministic feedback. We develop an upper bound on the error 
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Fig. 1. Channel with feedback that is a time invariant deterministic function of the output. 



which allows us to find an achievable rate for the channel. In addition, we state an upper bound on the capacity of 
the channel and show that when the state transition of the FSC does not depend on the input, the achievable rate 
equals the upper bound and hence equals the channel capacity. The main contribution of our work is in showing 
that the directed information, which was conjectured by Massey [5] to be the capacity of a channel with feedback, 
is achievable with a random coding scheme and maximum likelihood decoding, for any time-invariant deterministic 
feedback. 

Time-invariant feedback includes the cases of quantized feedback, delayed feedback, and even noisy feedback 
where the noise is known to the encoder. In addition, it allows a unified treatment of capacity analysis for two 
ubiquitous cases: channels without feedback and channels with perfect feedback. These two setting are special cases 
of time-invariant feedback: in the first case the time-invariant function of the feedback is the null function and in 
the second case the time-invariant function of the feedback is the identity function. 

The capacity of some channels with channel state information at the receiver and transmitter was derived by 
Caire and Shamai in [15]. Note that if the channel state information can be considered part of the channel output 
and fed beck to the transmitter, then this case is a special case of a channel with time invariant feedback. 

The remainder of the paper is organized as follows. Section |ll] defines the channel setting and the notation 
throughout the paper. Section[ni]provides a concise summary of the main results of the paper. Section lTVl introduces 
several properties of causal conditioning and directed information that are later used in finding an achievable rate. 
Section [V] provides the proof of achievability of capacity of FSCs with time invariant feedback. Section IVII gives 
an upper bound on the capacity. Section IVIII gives the capacity of an indecomposable FSC without intersymbol 
interference (ISI). Section fVIIII considers the case of FSCs with feedback and side information and shows that if 
the state is known both at the encoder and decoder then feedback does not increase the capacity of the channel. 
Section Hxl shows that optimality of source-channel separation holds in the presence of time-invariant feedback. We 
conclude in Section with a summary of this work and some related future directions. 
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II. Channel Models and Preliminaries 

We use subscripts and superscripts to denote vectors in the following way: x l = (x\ . . . Xi) and x\ = {x\ . . . xj) 
for i < j. For i < 0, x l defines the null string as does x\ when i > j. Moreover, we use lower case to denote sample 
values and upper case to denote random variables. Probability mass functions are denoted by P or Q when the 
arguments specify the distribution, e.g. P{x\y) — P(X — x\Y — y). In this paper, we consider only FSCs. The FSCs 
are a class of channels rich enough to include channels with memory, e.g. channels with intersymbol interference. 
The input of the channel is denoted by {X\, X2, ■ ■ . }, and the output of the channel is denoted by {Y±, Y2, ■ ■ ■ }, 
both taking values in a finite alphabet. In addition, the channel states take values in a finite set of possible states. The 
channel is stationary and is characterized by a conditional probability assignment P(yi, Si\xi, s^-i) that satisfies 

Piy^s^x 1 ^^ 1 ,^- 1 ) =P(y i ,s i \x i ,s i - 1 ). (4) 

An FSC is said to be without intersymbol interference (ISI) if the input sequence does not affect the evolution of 
the state sequence, i.e. P(si\si-i, Xi) = P(si\Si~i). 

We assume a communication setting that includes feedback as shown in Fig. ^ The transmitter (encoder) knows 
at time i the message m and the feedback samples z . The output of the encoder at time i is denoted by Xi and 
it is a function of the message and the feedback. The channel is an FSC and the output of the channel yi enters the 
decoder (receiver). The feedback zi is a known time-invariant deterministic function of the current output of the 
channel yi. For example, Zj could equal y - L or a quantized version of it. The encoder receives the feedback sample 
with one unit delay. 

Throughout this paper we use the Causal Conditioning notation (-||-), which was introduced and employed by 
Kramer [16], [17] and by Massey [18]: 

N 
i=l 

In addition, we introduce the following notation: 

N 

P(y N \\x N - 1 )^l[P(y i \x i -\y i - 1 ). (6) 

i=l 

The definition given in (|6jl can be considered to be a particular case of the definition given in (0 where xq is set 
to a dummy zero. This concept was captured by a notation of Massey in [18] via a concatenation at the beginning 
of the sequence x N ~ x with a dummy zero. The directed information I(X N — > Y N ) is defined in and, by using 
the definitions, we can express directed information in terms of causal conditioning as 



A P(Y N \\X N ) 



I(X N -> Y N ) = ^IiX'-Y^Y 1 - 1 ) = E 



log- 



-N „H vN 



(7) 



P(Y N ) 

where E denotes expectation. The directed information between X N and Y N , conditioned on S, is denoted as 

I(X N -> Y N \S) and is defined as: 



N 



I(X N -^Y N \S) ^^TliYt-X^Y 1 - 1 ^). (8) 
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III. Main Results 
In this section, we state the main results of the paper. 

• Causal conditioning and directed information: In Section II VI we establish some properties of causal 
conditioning and directed information that are used throughout the proofs, and also provide some intuition 
about the meaning of these terms. 

• Achievable rate: For any finite-state channel with an initial state denoted by sq, and with the feedback setting 
of Fig. [0 and for any R, < R < C_, where C is given by 

C* = lim — max minI(X N -> Y N \s ) (9) 

N-^oo N Q{x N \\z N ~ 1 ) s Q 

(a limit that can be shown to exist), and any e > 0, there exists an (N, M) block code such that for all 
messages m, 1 < m < M = |_2 J and all initial states, the decoding error is upper bounded by e. This 
achievability result is establish via analysis of a random coding scheme with maximum likelihood decoding. 

• Converse: For any given channel with the feedback as in Fig. ^ anv sequence of (N, 2 NR ) codes with 
probability of decoding error that goes to zero as N — > oo must have 

R < lim — max I(X N -> Y N ) (10) 

JV— oo N Q(x N \\z N - 1 ) 

where the limit is shown to exist. 

• Capacity: For an indecomposable FSC without ISI, the achievable rate and the upper bound are equal. Hence 
the capacity, which is defined as the supremum of all achievable rates of the channel, is given by: 

C= lim — max I(X N ^Y N ). (11) 

AT-too N Q(x N \\z N ~ 1 ) 

• State information and feedback: Feedback does not increase the capacity of a strongly connected FSC (every 
state can be reached from every other state with positive probability under some input distribution) when the 
state of the channel is known both at the encoder and the decoder. 

• Source-channel separation Source-channel coding separation is optimal for any channel with time-invariant 
deterministic feedback where the capacity is given by eq. il It . 

IV. Properties of causal conditioning and directed information 

In this section we present some properties of the causal conditioning distribution and the directed information 
which are defined in Section[n]in eq. Q, and (0. The properties are used throughout the proof of achievability 
and also help in gaining some intuition about those definitions and their role in the proof of the achievability. 

Lemma 1: Chain rule for causal conditioning. For any random variables (X N ,Y N ) 

P(x N ,y N )^P(y N \\ X N )P( X N \\y N - 1 ), (12) 
and, consequently, if Z N is a random vector that satisfies P(x N \\y N ^ 1 ) = P{x N \\z N ~ 1 ) then 

P(x N ,y N ) = P(y N \\x N )P(x N \\z N - 1 ). (13) 
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Proof: 

JV 



P(y N ,x N ) = HPiv^Xilx'- 1 ,^- 1 ) 



1=1 

N 



P(y N \\x N )P(x N \\y N - 1 ). (14) 



Note that there exists an analogy between this lemma and the chain rule P(x N ,y N ) = P(y N \x N )P(x N ). The 
analogy between the term P(y N \x N ) and the term P(y N \\x N ), and between the term P(x N ) and the term 
P(x N Wy 1 ^^ 1 ), can be helpful for deriving equalities for the case of causal conditioning distributions that are 
analogous to the equalities that hold for regular distributions. 
Let us define, 

AT 

P(y N \\x N ,s)^l[P(y i \ X \y i - 1 ,s), (15) 



i=X 

Lemma 2: For any random variables (X N , Y N , Z N ~\ S ) that satisfy P{x N \ (y^" 1 , s ) = P(x N \\z N - 1 ), 

P(x N ,y N \s ) = P(y N \\x N ,s )P(x N \\z N - 1 ) (16) 
The proof of Lemma [2] is similar to that of Lemma H an d therefore is omitted. 

Lemma 3: Causal conditioning is in the unit simplex. For any random variables (X N . Z N ~ V ), 

Y,P(x N \\z N - 1 ) = l (17) 

Proof: 

N 



xi x 2 x N i=l 

'N-X 



EE- • E 



Xl I 2 XN-1 



. i=l / x N 

N-X \ 

- EE---E (Ep^w- 1 ,^ 1 ) -i 

xi x 2 XN-i \ i=X ) 

In addition, 2~2 Xl P( x x) — 1- Hence, by induction, 2~2x N P(x N Wz 1 ^^ 1 ) = 1. ■ 
Lemma 4: There is a one to one correspondence between causal conditioning P(a; Ar ||z Ar_1 ) and the sequence 
of conditional distributions {P(xi\x l ~ 1 , z 1-1 )}^!- 

Proof: It is obvious that the sequence {P(xi\x i ~ 1 , z t ~ 1 )}^L 1 determines the term P(x JV ||z Ar_1 ). In the other 
direction we can use the proof of Lemma [3] in which we showed that P(x N ~ 1 \\z N ~ 2 ) is uniquely determined 
from P(x N \\z N ~ 1 ) by a summation over x^. Furthermore, by induction it can be shown that the sequence 
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{P(x*||z J is uniquely derived from P(x N \\z N 1 ) and then we can use the equality 

P(x l \\z 1 -' 
P(x i + 1 \\z 



to derive uniquely the sequence {P(xi \ x % 1 ,z z 1 )}^ =1 . ■ 
This lemma shows that the maximization in the capacity expressions can be done on the set of sequences 
{P(xi\x l ~ 1 , z l ~ 1 )}fL 1 or, equivalently, on the set of terms P(x N Wz 1 ^" 1 ). The lemma is analogous to the fact 
that maximization over the set P(x N ) is equivalent to maximization over the set of sequences {P(xi\x l ~ 1 )}^L 1 

Lemma 5: Let X N , Y N , Z N be arbitrary random vectors and S a random variable taking values in an alphabet 
of size Then 



\I(X N Y N \\Z N - 1 )-I{X N -> Y N \\Z N -\S)\ < H(S) < \og\S\. (20) 

In particular, if Z N is Y N , we get 

\I(X N ->Y N ) - I(X N ->Y N \S)\ <H{S) < log |«S|. (21) 
This lemma has an important role in the proofs for the capacity of FSCs, because it bounds by a constant the 
difference of directed information before and after conditioning on a state. The proof of the lemma is given in 
Appendix |I] 

The proof of the achievable rate for a channel with time-invariant feedback Zi(y\) is an extension of the proof 
of the achievable rate for a channel without feedback given in [2, Ch.5]. Roughly speaking, in each step we have 
to justify replacement of Q(x N ) by Q(x N Wz 1 ^^ 1 ) and of P(y N \x N ) by P(y N \\x N ). The replacement does not 
work in all cases, for instance it does not work in the case of Theorem 4.6.4 in [2]. At the end of the proof we will 
see that the achievable rate is the same expression as the mutual information with the probability mass function 
Q(x N ) replaced by Q(x N Wz 1 ^^ 1 ) and P(y N \x N ) replaced by P(y N \\x N ). The following lemma shows that the 
replacement results in directed information. 

Lemma 6: Denote: 

T(Q( X V-i),P(yV)) ^ X: ^ I k-- 1 )^^! k-) log ^^^l^^p^i^y , (22) 

y x 

if P{x N \\y N - 1 ) = Q(x N \\z N - 1 ) then, 

^(Q^II^-^P^II^)) = I(X N - Y N ), (23) 

and similarly, 

l(Q(x N \\z N - 1 ),P(y N \\x N ,8 )) = I(X N -^Y N \ So ) (24) 
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Proof: 



l(Q(x N \\z N - 1 ),P(y N \\x N )) 4 ^^Q(^||z w - 1 )P(^||^)log 



P(y N \\x N ) 



E x -Q^ N \\z N - 1 )P(y N \\x N ) 



(a) 



E 



log 



P(Y N \\X N ) 



E^Q( xN \\ zN - 1 ) p (Y N \\^ N ) 



EflogF^llX^)] -E logJ2Q(x N \\z N - 1 )P(Y N \\x N ) 

X N 

N 

lognTOI^,^ 1 ) — E [log P(Y N )] 
[E[log P(y« |X* , F i_1 )] - E[log PWY 1 - 1 )]] 



E 



(=i 

N 



Y^IiX^Y 1 - 1 ) 



(25) 



equalities (a) and (b) are due to Lemma ^ 

The following lemma is an extension of the conservation law of information given by Massey in [18]. 

Lemma 7: Extended conservation law. For any random variables (X , Y , that satisfy P(x N \\y N ^ 1 ) 

P{x N \\z N - 1 ), 

I(X N ; Y N ) = I(X N -> Y N ) + J({0, Z N - 1 } 



(26) 



where {0, Z x } is a concatenation of dummy zero to the beginning of the sequence Z 



N-l 



Proof: 



I(X N ;Y N ) ^ E 

( i } E 

= E 



log 
log 
log 



P(Y N ,X N ) 



P{Y N )P(X N )_ 
P(Y N \\X N )P(X N \\Z N - 1 ) 



P(Y N )P(X N 
P{Y N \\X N ) 



P{Y 



E 



p{x N \\z N ~y 



log 



P(X N ) 



i{x N -> r iV ) + /({o,^- 1 } -> x 



(27) 



Equality (a) is due to the definition of mutual information. Equality (b) is due to Lemma [2 and equality (c) is due 
to the definition of directed information. ■ 
The lemma was proven by induction in [18] for the case where Zi — yi- Here we see that the conservation law 
holds more generally when P{x n \\y n ~ 1 ) — P(x n \\z n ~ 1 ). In Subsection IV- Al we argue that this equality holds for 
the setting of deterministic feedback Zi{yi) and therefore the conservation law holds for the communication setting 
given in Fig [2 This lemma is not used for the proof of achievability, however, it gives a nice intuition for the 
relation of directed information and mutual information in the setting of deterministic feedback. In particular, the 
lemma implies that the mutual information between the input and the output of the channel is equal to the sum 
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of directed information in the forward link and the directed information in the backward link. In addition, it is 
straightforward to see that in the case of no feedback, i.e. when Zi is null, then I(X N ; Y N ) = I(X N — > Y N ). 

V. Proof of Achievability 

The proof of the achievable rate of a channel with feedback given here is an extension of the upper bound on 
the error of maximum likelihood decoding derived by Gallager in [2, Ch.5] for FSCs without feedback to the case 
of FSCs with feedback. The main difference is that for analyzing the new coding scheme, the feedback z l ~ 1 must 
be taken into account. 

Let us first present a short outline of the proof: 

• Encoding scheme. We randomly generate an encoding scheme for blocks of length iV by using the causal 
conditioning distribution Q(x \\z 1 ). 

• Decoding. We assume a maximum likelihood decoder and we denote the error probability when message m 
is sent and the initial state of the channel is sq as P e , m {so)- 

• Bounding the error probability. We show that for each N > N(e), there exists a code for which we can bound 
the error probability for all messages 1 < m < [2 NR \ and all initial states so by the following exponential, 

P e , m (s )<2- N ^ R ^. (28) 

In addition, we show that if R < C_ then E r (R) is strictly positive and, hence, by choosing e < E r (R), the 
probability of error diminishes exponentially for N > N(e). 

A. Random generation of coding scheme 

In the case of no feedback, a coding block of length N is a mapping of each message m to a codeword of length 
N and is denoted by x N (m). In the case of feedback, a coding block is a vector function whose i th component is 
a function of m and the first i — 1 components of the received feedback. The mapping of the message m and the 
feedback z 1 " 1 to the input of the channel Xi(m, z % ~ v ) is called a code-tree [19, Ch. 9] or strategy [20]. Figure |5| 
shows an example of a codeword of length N = 3 for the case of no feedback and a code-tree of depth N — 3 for 
the case of binary feedback. 

Randomly chosen coding scheme: We choose the i th channel input symbol Xi(m, z 1 ^ 1 ) of the codeword m 
by using a probability mass function (PMF) based on previous symbols of the code x* -1 (to, z l ~ 2 ) and previous 
feedback symbols z 1 ^ 1 . The first channel input symbol of codeword m is chosen by the probability function Q(x±). 
The second symbol of codeword m is chosen for all possible feedback observations z\ by the probability function 
Q{x-x\x , z 1 ) . The i th bit is chosen for all possible z 1 ^ 1 by the probability function Q(xi\x 1 ^ 1 , z 1 ^ 1 ). This scheme 
of communication assumes that the probability assignment of X{ given x 1 ^ 1 and z % ~ x cannot depend on y % ~ x , 
because it is unavailable. Therefore 

P{x i \x i -\z i -\y i - l ),y i - 1 )=P(x i \x i -\z i -\y i - 1 )l (29) 
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Codeword (case of no feedback) Code-tree (case of feedback) 




i=l i=2 i = 3 i = 1 «=2 i=3 

Fig. 2. Illustration of coding scheme for setting without feedback and for setting with feedback. In the case of no feedback each message is 
mapped to a codeword, and in the case of feedback each message is mapped to a code-tree. 

We also define Q{x | Iz^" 1 ), similarly as in (|6), to be the causal conditioning probability 

N 

Q{x N \\z N - 1 ) = J] Qfatf- 1 ,?- 1 ). (30) 

Encoding Scheme: Each message m has a code-tree. Therefore, for any feedback z N ~ x and message m there is 
a unique input x N (m, z 1 ^^ 1 ) that was chosen randomly as described in the previous paragraph. After choosing the 
coding scheme, the decoder is made aware of the code-trees for all possible messages. In our coding scheme the 
input x (m, z N ~ x ) is always a function of the message m and the feedback, but in order to make the equations 
shorter we also use the abbreviated notation x N for x N (m, z"" 1 ). 

Decoding Scheme The decoder in our scheme is the Maximum likelihood (ML) decoder. Since the codewords 
depend on the feedback, two different messages can have the same codeword for two different outputs, therefore 
the regular ML argmax^w P(y N \x N ) cannot be used for decoding the message. Instead, the ML decoder should 
be argmax m P(y N \m) where N is the block length. The following equation shows that finding the most likely 
message m can be done by maximizing the causal conditioning P(y N \\x N ): 



argmaxlogP(t/ \m) = argmaxlogP(y ||x ). (31) 

m m 
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The equality in i3l\ is shown as follows: 

P(y N \m) = JJP^I/- 1 , 



m 



i 

= n p (»if <_1 ' z< " i (f < " 1 ))) 

i 

= P{y N \\x N ). (32) 

Equality (a) holds because x l is uniquely determined by the message m and the feedback and the feedback 
z 1 ™ 1 is a deterministic function of y 1 ^ 1 . Equality (b) holds because according to the channel structure, does not 
depend on m given x % . Equality (c) follows from the definition of causal conditioning given in eq. (|5j. 

B. ML decoding error bound 

The next theorem, which is proved in Appendix UTI is a bound on the expected ML decoding error probability 
with respect to the random coding. Let P e . m , as in [2, Ch. 5.2], denote the probability of error using the ML 
decoder when message m is sent. When the source produces message m, there is a set of outputs denoted by Y m c 
that cause an error in decoding the message m, i.e., 

Pe,m= ^l 171 )- ( 33 ) 

Theorem 8: Suppose that an arbitrary message m, 1 < m < M, enters the encoder with feedback and that ML 
decoding is employed. Then the average probability of decoding error over this ensemble of codes is bounded, for 
any choice of p, < p < 1, by 

i i+p 



E(P e ,m) < (M -l) P J2 



^g(^ JV ||z JV - 1 )P(y JV ||^ JV )ntrt 

y" Li' 



(34) 



where the expectation is with respect to the randomness in the ensemble. 

Let us define P e , m (so) to be the probability of error given that the initial state of the channel is so and message 
m was sent. The following theorem, which is proved in Appendix IIIII establishes the existence of a code such that 
Pe,m{so) is small for all 1 < m < M. 

Theorem 9: For an arbitrary finite-state channel with \S\ states, for any positive integer N and any positive R, 
there exists an (N,M) code for which for all messages m, 1 < m < M = [2^^], all initial states sq, and all p, 
< p < 1, its probability of error is bounded as 

P e , m (s ) < 4\S\2t- N l-? R+F »^\ (35) 

where 

n lnc L*> I », »r , 

(36) 



plog 5 
F N (p) = tt — + max 

N Q{x N \\z N - 1 ) 



min E ^ N (p,Q(x \\z ),s ) 

so 
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E o , N (p,Q(x N \\z N - 1 ),s ) = -^log^ 



i+p 

(37) 



. ^ ^2Q(x N \\z N - 1 )P( y N \\x N , So )T^ 

y N l x N 

The following theorem presents a few properties of the function E ^{p, Q(x n \\z n ~ 1 ), sq) which is defined in 
eq. (I37> . such as positivity of the function and its derivative, and convexity of the function with respect to p. 
Theorem 10: The term E Q ^(p, Q{x N \\z N ^ 1 ) 1 sq) has the following properties: 

E O!N (p,Q(x N \\z N - 1 ),s ) >0; p>0. (38) 
^l(Q(x N \\y^),P(y N \\x N , So ))> dEoMP^Q^V- 1 ),^) > Q , p > Q (39) 

> n ^,vk „ " ^-^ > 0; p > 0. (40) 

op 2 

Furthermore, equality holds in fl38i when p = 0, and equality holds on the left side of eq. fl39l > when p = 0. 
The proof of the theorem is omitted because it is the same proof as Theorem 5.6.3 in [2]. Theorem 5.6.3 in [2] states 
these same properties with Q(x N Hz^ -1 ) and P(y N \\x N ) replaced by Q(x N ) and P(y N \x N ), respectively. The 
proof of those properties only requires that Q(x N \\z N ~ 1 ) = 1 and >~2 X N y N Q{x N \\z N ~ 1 )P(y N \\x N , So) = 1, 
which hold according to Lemmas [5] and \l\ By using Lemma|6]we can substitute X(Q(x N \\y N ~ 1 ) 7 P(y N \\x N , sq)) 
in (1391 by the directed mutual information I(X — > Y N \so). 

Lemma 11: Super additivity of i*jv(p). For any given finite-state channel, F/v(p), as given by eq. (I36> . satisfies 



71 / 

ifr(p) > jjF n (p) + jjFfo) (41) 



for all positive integers n and I with N = n + I, 
The proof of the lemma is given in Appendix IIVI 

Lemma 12: Convergence of Fpj(p). Let 



then 



F 00 (p)= sup F N (p), (42) 

N 



lim F N (p) =F ao (p), (43) 

N^oq 



for < p < 1. Furthermore, the convergence is uniform in p and F OQ (p) is uniformly continuous for p £ [0, 1]. 

Proof: Lemma 4A.2 in [2] states that if a series a„ is super additive, i.e. ajv > jrCLn + ^fl/v-n, then 
lim7v->oc>aAr = sup^a^. Based on Lemma fTTI which states that {Fn{p)} is super additive, we get that Fjv(p) 
converges to sup N Fn(p)- From Theorem 1 101 it follows that 

o < dEoMpMfW^so) < I J(X * ^ ^ < log |y| (44) 

op AT 



13 



where \y\ is the size of the output alphabet. Using this bound with the definition of Fn given in eq. d36l > we can 
bound the difference of Fn{p) for any < pi < p2 < 1 as 

~{P2 - Pl)\0g\S\ 



N 



< F N (p 2 ) ~ F N ( Pl ) < (p 2 - Pl ) log \y\. 



(45) 



A consequence of ( 145 \ is that the function Fpj(p) and its slope are bounded independent of N for each < p < 1. 
Therefore the convergence is uniform in p and i 7 ^ is uniformly continuous. ■ 
Theorem 13: Let us define 



C N — — max mmI(X N 

N Q{x N \\z N - 1 ) s 



Y N \s ) 



and 



C= lim C 



N^oo 



Then, for a finite state channel with \S\ states the limit inEl ex i sts an d 

log\S\ 



lim C_ N — sup 

AT— »oc jv 



c 



N 



N 



supC_ N . 

N 



(46) 



(47) 



(48) 



Proof: 



Let us divide the input x into two sets, xi = a;" and X2 = a;„ + i- Similarly, let us divide the output 
y N into two sets y 1 = y^ and y 2 = y^+x- Let Q„(x 1 ||z 1 ) = Y\^l =1 P{x l \x l ,y 1 ^ 1 ) and Qz(x 2 ||z 2 ) = 
ni=i -P^n+iK+i: 2/^+i _1 ) be the probability assignments that achieve C_ n and C_ t , respectively. Let us consider 



the probability assignment Q{x \\ 



N\\„N-1\ 



<5n(xi||zi)Q;(x 2 ||z 2 ). Then 



NC N > mh\I{X N -> Y N \s Q ) 

so 



(a) 



(b) 



(c) 



mm 

so 



n+l 



i=l 



j=n+l 



n+l 



> nC n + mm £ I(Y fl X^Y^ly^ s ) 



j=n+l 
n+l 



> nC„+min J2 I(Y ] :X' n+1 \Y^- 1 1 ,y 1 ,S n , So ) + \og\S\ 



j=n+l 



n+l 



> nC n +min^P( S „| So ) £ T ^ X l+i\ Y n+l Yi, *») + log \S\ 



j=n+l 



n+l 



> nC n + min £ 7(F i; S „) + log \S\ 

j=n+l 

= nC n + lC l+ \og\S\. 



(49) 



Equality (a) is due to the definition of the directed information. Inequality (b) holds because C_ n is the first term 
and for the second term we use the fact that I(X; Y, Z) > I(X; Y) for any random variables (X, Y, Z). Inequality 
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(c) is due to Lemma [5] Rearranging the inequality we get: 



N 



C 



log|5| 



JV 



N 



> n 



\og\S\ 



+ 1 


\ r log|5|l 






r i \ 



(50) 



Finally, by using the convergence of a super additive sequence, the theorem is proved. ■ 
A rate R is said to be achievable if there exists a sequence of block codes (iV, [~2 ] ) such that the maximal 

probability of error max m P e , m (so) tends to zero as N — > oo for all initial states so [21]. The following theorem 

states that any rate R that satisfies R < C_ is achievable. 
Theorem 14: For any given finite-state channel , let 



E r (R) = max [F^ (p) - pR] . 

0<p<l 



(51) 



Then, for any e > 0, there exists N(e) such that for N > N(e) there exists an (N,M) code such that for all 
m, 1 < m < M = [2^], and all initial states, 



Pe,m(s ) < 2 



-N[E r (R)-e] 



(52) 



Furthermore, for < R < C_, E r (R) is strictly positive, and therefore the error can be arbitrarily small for N large 
enough. 

Proof: For any rate R, we can rewrite eq. j35t as 



Pe, m ( S 0)<2-^+^>-^). 



(53) 



Because of the uniform convergence in p proven in Lemma ^] for all e > 0, there exists an N(e) that does not 
depend on p such that, for N > N(e), 



N 



Hence, it follows from J53i that 



Pe,m{s ) < 2 



-N^-pR+F^-e) 



(54) 



(55) 



If we choose the p that maximizes —pR + F OQ (p) (note that F oa (p) and therefore —pR + -^(p) is continuous in 
p € [0, 1], so there exists a maximizing p), then inequality J55t becomes inequality ( l52l >. proving the first part of 
the theorem. 

Now let us show that if R < C, then E r (R) > 0, which will prove the second part of the theorem. Let us define 
5 = C_ — R . According to Theorem [O] Cm converges to supjy Cn = Q_, hence we can choose N large enough so 
that the following inequality holds: 

log 1*1 , ^ 



C_ N > i? - 



iV 



From Theorem fTOI we have 



dE , N (p,Q{x N \\z N -i),a Q ) 
dp 



< C N , Vs , 



(57) 



where Q(x N \\z ) is chosen to be the distribution that achieves C_ 



■N- 
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Note that E ^{p, Q{x N \ \z N 1 ),so) is zero when p = 0, is a continuous function of p, and the derivative at 
zero with respect to p is equal to C_ N > R + los ^ + |. Thus, for each state s there is a range p > such that 

E o , N (p,Q(x N \\z N - 1 ) lS0 ) - p(R+ 1 -^^) > 0. (58) 

Moreover, because the number of states is finite, there exists a p* > for which the inequality J58l > is true for all 
so. Thus, 

F oo (p*)>F N (p*)>E o , N (p,Q(x N \\z N - 1 ), S0 )-p* 1 ^p->p*R, Vs , (59) 
and thus E r (R) > for R < C. ■ 



C. Feedback that is a deterministic function of a finite tuple of the output 

We proved Theorem [2] for the case when the feedback zi is a deterministic function of the output at time i, i.e 
z, = z(y,i). We now extend the theorem to the case where the feedback is a deterministic function of a finite tuple 
of the output, i.e. z t = z(yi- D -i, — ,Ui). 

Consider the case D — 2. Let us construct a new finite state channel, with input Xj, and output yi that is the 
tuple The state of the new channel Si is the tuple {sj,j/j}. 

Let us verify that the definition of a FSC holds for the new channel: 

P(yi,Si\y i ' 1 ,s i ~ 1 ,x i ) = P{y l ,y l - U s l ,y l \y z -\s t ' 1 7 y t ~ 1 ,x t ) 

= P(yi,yi-i,Si\yi-i,Si-i,Xi) 

= P(yi,8i\si-i,Xi) (60) 

Both channels are equivalent, and because the feedback z\ is a deterministic function of the output of the new 
channel, yi, we can apply Theorem [2] and get that any achievable rate satisfies 



R < C N ^ — ^ t m&x^ mmI(X N ^Y N \s ) 



1 

N Q(x^]\zN-i)'T 

1 

N Q^wjfl™- 1 ) s~a~Vo 

N 



nmx minI(X N -» {Y N , Y t ^~ 1 }\s , y ) 



1 max miny]/(A- < ;y i ,y < _i|y*- 1 ,y < - 2 ,so,l«)) 

i—1 

1 - 

- max min V H^Y^Y*- 1 ^*- 2 , s , y ) - H(Y, Y^Y^ 1 ,Y*~ 2 , X 4 , s , y ) 

I\ Q(x N \\z N - 1 ) so,yo ^— ^ 
1 - 

— max minV H{Y t \Y l -\ s , y a ) ~ H{Y t \Y l -\ X\ s , y ) 



— max minI(X N —>Y N \sa,yo) 

N Q(x N \\z N - 1 ) s a ,yo K ' "'"^ 



(61) 
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This result can be extended by induction to the general case where the feedback Zi depends on a tuple of D 
outputs, leading to the achievability of any rate smaller than limjv-too Q_ N , where in this setting 

C N = -J- max min I(X N — > Y N \s 0) y 2 -M, • ■ ■ , 2/o)< (62) 

VI. Upper bound on the feedback capacity 

Theorem 15: The capacity of a channel where the input is x N and the output is y N and the channel has a time 
invariant deterministic feedback, as presented in Fig. [2 is upper bounded as 

C FB < lim max I(X N -> Y N ). (63) 

A r ^OOQ( 2 ;«|| 2 «-l) 

Proof: Let W be the message, chosen according to a uniform distribution Pr(W — w) — 2~ R . The input 
to the channel Xi is a function of the message W and the arbitrary deterministic feedback output z* (y* ). We 
have 

NR = H(W) 

= I(W;Y N )+H(W\Y N ) 



(a) 



< I(Y N ;W) + 1 + P^ N) NR 



(6) 



H(Y N ) - H(Y N \W) + 1 + P^ N) NR 

N N 

Y / H(Y l \Y t - 1 ) -J2h{Y 1 \W,Y 1 - 1 ) + l + p( N) NR 

i=l i=l 



N N 

= HiY^Y 1 - 1 ) - J2 H(Y\W, Y i ^ 1 ,X i (W, z'-^Y 1 - 1 ))) + 1 + P^ N) NR 



i=l i=l 



N N 

- ^HiY^Y 1 ' 1 ) - ^HtYilY*- 1 ,^) + 1 + P { 2 N) NR 



i=l 
N 



^IiYi-X^Y^ + l + P^'NR (64) 



i=l 

Inequality (a) holds because of Fano's inequality. Equality (b) holds because of the chain rule. Equality (c) holds 
because Xi is a deterministic function given the message W and the feedback z l ~ 1 , where the feedback z 1 ^ 1 is 
a deterministic function of the output. Equality (d) holds because the random variables W^Xi^Y 1 ' 1 ,Yi form the 
Markov chain W — (Xi^Y 1 ^ 1 ) — lj. By dividing both sides of the equation by N, maximizing over all possible 
input distributions, and letting N — > oo we get that in order to have an error probability arbitrarily small, the rate 
R must satisfy: 

N N 

R< lim — max V I(Yr, X'ly^ 1 ) = lim max — V I(X N -> Y N ). (65) 

N—>cc N Q(X N \\Z N - 1 ) ^- — ' JV^oo Q(X N \\Z N - 1 -) N f—' 

This completes the proof. ■ 
Remark: The converse proof is with respect to the average error over all messages. This, of course, implies that 
it is also true with respect to the maximum error over all messages. In the achievability part we proved that the 
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maximum error over all messages goes to zero when R<C_ which, of course, also implies that the average error 
goes to zero. Hence, both the achievability and the converse are true with respect to average error probability and 
maximum error probability over all messages. 

VII. Indecomposable FSC without ISI 

In this section we assume that the channel states evolve according to a Markov chain which does not depend 
on the input, namely P(yi, Si|sj_i, xi) — P(si\si-i)P(yi\si,Si-i,Xi). In addition, we assume that the Markov 
chain is indecomposable. Such a channel is called a Finite State Markovian indecomposable channel (FSMIC) in 
[22], however another suitable name which we adopt henceforth is a FSC without ISI. The difference between this 
channel and the indecomposable FSC defined in [2], [3] is that here we make an additional assumption that the 
transition probability between states is not a function of the input. 

A Markov chain with transition matrix P(i,j) is indecomposable if it contains only one ergodic class [23]. An 
equivalent definition is that the effect of the initial state of the Markov chain dies away with time. More precisely: 

Definition 1: A Markov Chain is indecomposable if, for every e > 0, there exists an No such that for N > No, 

\P{s N \s ) -P(s N \s' )\ < e (66) 

for all sjv, so, s' Q . 

In this section we prove that for a FSC without ISI the achievable rate does not depend on the initial state sq 
and therefore the lower bound and the upper bound on the capacity as given in J47i and (I63> are equal. 
Let us define 

C N =^- T max maxI(X N ^Y N \so) (67) 

N Q(x N \\z N - 1 ) so 

and 

C= lim C N , (68) 

JV— >oc 

a limit that will be shown to exist. In addition let us define 

C = lim — max I(X N -> Y N ). (69) 
Theorem 16: For a FSC without ISI, 

C = C = C= lim — max I(X N -> Y N ), (70) 

— N^oo N Q(x N \\z N - 1 ) 

where C was defined in (|g8j and C in gTJ. 

Proof: For arbitrary N, let Qn(x n \ Iz^ -1 ) and s' be the input distribution and the initial state that maximize 
Iq(X n — > Y N \s'o) and let s' ' denote the initial state that minimizes Iq(X n — > Y \s'q) for the same input 
distribution, where the subscript in Iq is added to emphasize its dependence on Qn, though we suppress the 
subscript N from Qn- Thus, we have 

±I Q (X N - KVo) = C N > C N > ±I Q (X N - Y N \4). (71) 
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The equation holds due to the definitions of Cat and C_ N . Next, we will prove that limjv^oo jtIq(X n — » F^Isq) = 
lirriiv^oo t?Iq(X n -> Y N \s'^) and therefore C = C. 

Let n + I = N, where n and I are positive integers and let s' and Sq be any two initial states. Let the random 
variable S n be the state at time n. We would like to emphasize that the difference in the letter case notation is 
because s' and Sq are specific states while S n is a random variable. Then 
1 



N 



Iq(X 



N 



Y N \s' ) 



(<0 1 
< -[\og\S\+I Q (X 



N 



Y \ s 0'S n ) 



(M 
< 



1 

N 
1 

N 
1 

TV 
1 

N 



N 



iog|5| + ^/ Q (y J ;X l |^- 1 ,5„, s [ ) )+ 53 / Q (y ; x^x; +1 |:r-\,s n ,4) 

i=X i— n+1 

AT 

io g |5| + niog|3;|+ ^ loCy^-xr,^!^- 1 ,^,^) 

AT 

log |5| + nlog |y| + ^ [I Q (Xi; X^Y 1 ' 1 , S n , s' ) + I Q (Y i; X^Y*" 1 , X* n+1 , S n , s' Q ) 



i—n+l 
N 



log \s\ + n log \y\ + 53 / Q (y <; x; +1 |F i - 1 ,5„,4) 



(72) 



Inequality (a) is due to Lemma|5] Inequality (b) is due to the bound Iq(Y^ Jf^Y* -1 , S n) s' ) < log |3^|- Equality 
(c) holds because given the state S n and the input after time n, the output after time n does not depend on the 
input before time n, i.e. P{y i \y 1 - 1 , x l n+1 , s n , x\, s ) = P(y i \y % ~ 1 , x l n+1 , s n , s ) , i > n. By using inequality |72j 
we can bound the difference between the directed information starting at two different states: 



< 



N 



jj\I Q (X« ^Y«\s' )-I Q (X» ^Y N \& 

N 



log \S\ + nlog \y\ + 53 [I Q (Xi; X; +1 |y i " 1 ,5„, 4) - IqiY^X^Y*- 1 , S n , s%)] 



i=n+l 



(73) 



The sum in the last inequality can be bounded by using the indecomposability property of the Markov chain. For 
every i > n we have: 

Iq^X^Y*- 1 , s n , 4) - I Q (Y i; X^Y'-^Sn, 4) 

= 53[P( s „|/ )-^(^l4')]^m;^ + il^ i -- 1 1 , s „,4) 

8 n 

(b) 



< 5][ p ( s "' s o) - p ( s n\4)} log \y\ 



< e n log \y\ 



(74) 



Equality (a) is achieved by summing over all possible states s n . Inequality (b) is achieved by bounding the magnitude 
of each term in the sum by log \y\. Inequality (c) holds by defining e„ = max s ^ s » | J2 S „ P(s n \s' ) — P(s„|sq)|. 
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Combining eq. d73l > and eq. J74i we obtain: 

±\Iq(X n - Y N \s' )-I Q (X N -» < ^ [log |5| +n log \y\ +e n ■ \S\ ■ l\og\y\] 

< ±iog\s\ + ^iog\y\ + e n io g \y\}. 

Since, by the indecomposability of the channel, e — > oo, and since inequality £151 holds for all < n < N, it 
follows from inequality (175b (by letting n increase without bound, but sub-linearly in N) that 

^hm l\I Q (X N ^Y N \s' )-I Q (X N ^Y N \4)\ = 0. (75) 

Up to now, we have proved that liniAr^oo C_ N = lim^v-foo Cjy and this is because of eq. M5\ and Finally, 
we show that even without conditioning on sq we get the same limit. Indeed, 

/~< a 1 m „ t(yN \rN\ ( a ) 1 TO „ TlvN vN\Q \ . I<->I 

C N = — max /(A -> y ) = — max 7(A -► Y 6 ) + — 

Q(a; N ||z JV - 1 ) A Q(x JV ||z JV " 1 ) N 

AT QfxNlb"- 1 ) ' A 

so 

< — max m^I(X N -^Y N \s ) + — 

AT Q^WHz™- 1 ) so A 

= Cjv + J^L. (76) 
Equality (a) holds because, according to Lemma [SJ the magnitude of the difference between the expression in the 

ICl I o 1 

two sides of the equation is bounded by i ^ L . In a similar way we prove that Cat > C_ N — Kj- and therefore we get 
that limjv-foo C_n — lhniV-too Cn = limAr^oo Cn which concludes the proof. ■ 

The capacity of a channel is defined as the supremum over all achievable rates, analogous to what is done in 
the absence of feedback [21]. 

Theorem 17: The capacity of an Indecomposable FSC without ISI with a time invariant feedback is given 

by 

C = lim — max I(X N -» Y N ), (77) 

JV->oo N Q(x N \\z N - 1 ) 

where C denotes the capacity of the channel in the presence of feedback. 

Proof: According to Theorem [21 for any given finite state channel, any rate R in the range 
< R < C_, is achievable. According to Theorem the upper bound on capacity of a FSC is 
limjv^oo jj maxQ( x jv|| z N-i) I(X N — > Y N ). Hence we get that the capacity C is bounded from below and from 
above by: 

C < C < lim — max I(X N -> Y N ). (78) 

oo N Q(x N \\z N ~ 1 ) 

Theorem [H)] states that, for an indecomposable FSC without ISI, the upper bound equals the lower bound, i.e. 
C = limjv—oo jf maxQ( x jv|| z jv-i) I(X N — * Y N ), and therefore the capacity is given by (77). ■ 
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Fig. 3. Channel with feedback and side information £j. 



VIII. Feedback and side information 

The results of the previous sections can be extended to the case where side information is available at the decoder 
that might be also fed back to the encoder. Let li be the side information available at the decoder and the setting 
of communication the one in Fig. [3] If the side information li satisfies 

P(k, y u Sl \s l -\x\ f-\ P" 1 ) = P(li, Vi, Si\si- U xt) (79) 

then it follows that 

Piy^s^-^x*,^- 1 ) =P(y i ,s i \s i -. 1 ,x i ), (80) 
where yi — (U,yi). We can now apply Theorem 1 141 and get: 

C N = 4 max minJfX^ ->■ {Y N ,L N }\s ), (81) 

— N N Q(x«| |z iv-i) so K 

where Zi—i denotes the feedback available at the receiver at time i which is a time-invariant function of and 
Vi-i- 

While many cases of side information can be studied, we are going to consider only the case in which the side 
information is the state of the channel, i.e. li = Si, which is fed back to the encoder, namely we let Zi(yi, li) = Sj. 
In this section we no longer assume that there is no ISI, instead we assume that the FSC is strongly connected, 
which we defines as follows. 

Definition 2: We say that a finite state channel is strongly connected if there exists an input distribution 
{Q(xt\st-i)}t>i and integer T such that 

Pr{S t = s for some 1 < t < T\S Q = s'} > 0, Vs', s. (82) 
Theorem 18: Feedback does not increase the capacity of a strongly connected FSC when the state of the channel 
is known both at the encoder and the decoder. Furthermore, the capacity of the channel under this setting is given 
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by 

1 N 

C= lim — max V I{Y h S^; X^-i). (83) 

n-oo JV {Q(xi| Si _i)}^ 

A straightforward consequence of this theorem is that feedback does not increase the capacity of a discrete 
memoryless channel (DMC), which Shannon proved in 1956 [1]. A DMC can be considered as an FSC with 
only one state, and therefore the state of the channel is known to the encoder and the decoder. 

Proof: First, we notice that because the state of the channel is known both to the encoder and the decoder, and 
because the FSC is strongly connected, we can assume that with probability 1 — e, where e is arbitrarily small, the 
FSC channel can be driven, in a finite time, to the state that maximizes the achievable rate. Hence, the achievable 
rate does not depend on the initial state and the capacity of the channel in the present of feedback, which we denote 
as , is given by lirrijv^oo where cffl satisfies 

C { N F) = ^ max I(X N — > {Y N ,L N }) 

N N Q(x N \\z N - 1 ) V 



= ^ max J2 I ( Xi '> Y i> S i\ Yi ~ 1 > Si ~ 1 ) 



i=l 
N 



max JlH(Y i ,S i \Y i - 1 ,S i - 1 )-H(Y i ,S i \Si- 1 ,X i ) 



N Q(x N \\z N - 1 ) 
1 

— max 

N Q(a:«||z«-1) ^ 

T7 max / 

N Q(x N ||2 N - 1 ) j 

1 

— max 

N {Q(x(| ai _i)} 



i=l 

JV 



1 



iv^.nE^'^^l^ (84) 
Equality (a) follows by replacing Li with 5, according to the communication setting. Equality (b) follows from the 
FSC property. Inequality (c) holds because conditioning reduces entropy. Equality (d) holds because maximizing 
over the set of causal conditioning probability Q{x N \\z N ~ 1 ) is the same as maximizing over the set of probabilities 
{Q(xi\si-i)}fL 1 , as shown in the following argument. The sum J2iLi Si> is determined uniquely by 

the sequence of probabilities {P(yi, Sj, a;,, Sj-i}^. Let us prove by induction that this sequence of probabilities 
is determined by {Q(xi\x z ~ 1 , y 1 ^ 1 , s 1 ^ 1 )}^ only through {Q(xi\si-\)}^ 1 . For i = 1 we have 

P{yi,si,xi,s ) = P(s )Q(x\s )p{yi, si\xi, s ). (85) 

Since P(so) and P(yi,si\xi,so) are determined by the channel properties, the input distribution to the channel 
can influence only the term Q(x\sq). Now, let us assume that the argument is true for i — 1 and let us prove it for 
i. 

Piy^SuXuSi-!) = P{s i _ 1 )Q{xi\si_ 1 )P{y i ,s i \xi,s i _ 1 ). (86) 

The term P(s,_i) is the same under both sequences of probabilities because of the assumption that the argument 
holds for i — The term P(yi, Si\xi, Si-i) is determined by the channel, so the only term influenced by the input 
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distribution is Q(xi\si-x), This proves the validity of the argument for all i and consequently, the equality (d). 

Inequality J84t proves that the achievable rate, when there is feedback and state information, cannot exceed 
limjv-xx, maxQ^jvigJv-i) 7(Yi, S^; Xil^-i). Now let us prove that if the state of the channel is known 

at the encoder and the decoder and there is no feedback, we can achieve this rate. For this setting we denote the 
capacity as C^ NF ' and as in the case of feedback, the capacity does not depend on the initial state and is given as 
linijv^oo Cj^ , where Cj^ F ' satisfies 

C N = 1 max I(X N ^{Y N ,L N }) 

N Q(x N \\z N - 1 ) 



(a) 1 

max 



N 



N Q(x N \\s N - 1 ) ■ 

jj S^_, £ff(y<,Si|Y < - 1 ,s i - 1 )-fl-(y i ,5 j |y i - 1 ,5 < - 1 ,x < ) 



N 



N Q(x 
(b) 1 



i=l 
N 



^ v 11 ' i=l 

Ml A • , ■ 1 

^ , m , ax „y,S(Xi i Si\Y*- l ,S % - 1 )-H<Xi,Si\S i -i,X i ) 

1 w 

iV {Qtiilsi-i)} f— ' 



1 

-/V {Q(i7k 



max: ^ J(^, 5^; Xj|^_ x ) 



»=i 



> C^- (87) 

Equality (a) follows by replacing Li and Zi with Si according to the communication setting. Equality (b) 
follows from the FSC property. Inequality (c) holds because we restrict the range of probabilities over which 
the maximization is performed. Equality (d) holds because under an input distribution Q(xi|sj_i), we have the 
following Markov chain: (Yt, Si) - Sj_i - (Y* -1 , S^ 2 ). Inequality (e) holds due to i84l. 

Taking the limit N -> oo on both sides of shows that C (NF ^ > C^ F \ Since trivially also < 
we have CW = C7( F ). ■ 

IX. Source-channel separation 

In this section we prove the optimality of source channel separation for the case of an ergodic source that 
is transmitted through a channel with a deterministic time-invariant feedback. Namely, we prove that in the 
communication setting presented in Fig. [5] the number of bits per channel use that can be transmitted and 
reconstructed within a given distortion is the same as in the communication setting of Fig |4] 

Let us state the source-channel separation theorem as presented in [24, Chapter 7]. 

Theorem 19: Let e > and D > be given. Let R(-) be the rate distortion function of a discrete, stationary, 
ergodic source with respect to a single letter criterion generated by a bounded distortion measure p. Then the source 
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Fig. 5. Source and channel separation. 



output can be reproduced with fidelity D at the receiving end of any channel if C > R(D). Conversely, fidelity D 
is unattainable at the receiving end of any channel of capacity C < R(D). 

Remark: For the simplicity of the presentation we assumed one channel use per source symbol. Our derivation 
below extends to the general case where the average number of channel uses per letter is 2*, analogously as in [2, 
chapter 9]. 

The purpose of this section is to prove the theorem for a channel with time-invariant feedback, as shown in Fig. 
13 for the cases where its capacity is given by 

C = lim — max I(X N -> Y N ). (88) 

N^oo N Q{x N \\z N - 1 ) 

In the case of no feedback the proof of separation optimality is based on data processing inequality which states 
that I(U N ; V N ) < I(X N ; Y N ) because of the Markov form U N - X N - Y N - V N . However, the regular data 
processing inequality does not hold for the directed information and therefore an explicit derivation of the inequality 
I(U N ; V N ) < I(X N -> Y N ) is needed. Proof: The direct proof, namely that if C > R(D) it is possible to 
reproduce the source with fidelity D is the same as for the case without feedback [24, Theorem 7.2.6]. 

For the converse, namely that R(D) has to be less or equal C, we use the fact that for any i, the Markov chain 
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U N - Xi{U N ^Y 1 - 1 ) -Yi holds. 



NR(D) < I(U N ;V N ) 
< I(U N ;Y N ) 
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JV 



('■0 



(e) 
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JV 

^^(yiiy*- 1 )-^!^,^- 1 ) 
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JV 

^H{Y l \Y 1 ^ 1 ) - H(Y i \U N ,Y i ~ 1 ,X i ) 

i=l 
JV 

^(yiy 1 - 1 )-^!^ 1 ,* 1 ) 



i=l 
JV 



i=l 

I(X N -► Y N ) 



(/) 

< NC. (89) 

Inequality (a) follows the converse for rate distortion [24, Theroem 7.2.5]. Inequality (b) follows the data processing 
inequality because U N — Y N — V N form a Markov chain. Equality (c) follows the chain rule. Inequality (d) 
follows the fact that Xi is a deterministic function of (U , Y 1 " 1 ). Inequality (e) follows the Markov chain U N — 
Xi(U N , Y 1-1 ) — Yi. Finally, inequality (f) follows the converse of channel with feedback given in Theorem II 51 ■ 

X. Conclusion and future work 

We determined achievable rate and the capacity upper bound of FSCs with feedback that is a deterministic 
function of the channel output. The achievable rate is obtained via a random generated coding scheme that utilizes 
feedback, along with a ML decoder. In the case that the channel is an indecomposable FSC without ISI, the upper 
bound and the achievable rate coincide and, therefore, they are the capacity of the channel. One future direction is 
to generalize the channels for which the achievable rate equals to the upper bound on the capacity and to use this 
formula in order to compute the capacity in various settings involving side information and feedback. 

By using the directed information formula for the capacity of FSCs with feedback developed in this work, it was 
shown in [25] that the feedback capacity of a channel introduced by David Blackwell in 1961 [26], also known 
as the trapdoor channel [27], is the logarithm of the golden ratio. The capacity of Blackwell's channel without 
feedback is still unknown. Another future work is to find the capacity of additional channels with time-invariant 
feedback. 
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Appendix I 
Proof of Lemma|5] 



\i(x N y n \\z n - 1 ) -i(x N -> r^Hz^- 1 ,^)! 
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^ I(Yl; X l \Y'-\ Z 1 - 1 ) - KXi-^lY*- 1 , Z*-\ S) 
1—1 
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HfXilY*- 1 , Z 1 - 1 ) - H{Y l \Y l - 1 1 X\ Z*- 1 ) - H(Yi\Y i ~ 1 , Z* -1 , S) + HfflY*- 1 , X\ Z* _1 , S) 

»=i 

JV 

^ H(Y-|y i - 1 , - if (Y^y* -1 , 5) - H(Y i \Y i - 1 ,X i , + ff(Fi|F i - 1 ,X i , Z* -1 ,^ 

iV 

^ /(y ; siy*- 1 , z*- 1 ) - J(y ; siy*- 1 , a-*, z*- 1 ) 

i=l 



(*>) 
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(iv iV \ 

£ /(y ; siy*- 1 , s*- 1 ), J2 tot; s\y i ~\x\, z*- 1 ) 
i=\ i=l J 

(c) ( N N ' 

< max ^ 7(y, Z; SlY*" 1 , Z*" 1 ), £ /(Y*, Z is 5IY*" 1 , X\ Z^ 1 ) 

\i=i i=i , 

( => max (1(7", Z N ; S),I(Y N , Z N , ; 5)) 



(e) 

< max (H(S),H(S)) 

< log |^| (90) 

Equality (a) is due to the definition of the directed information. Inequality (b) holds because the magnitude of 
the difference between two positive numbers is smaller than the maximum of the numbers. Inequality (c) is due 
to the fact that I(X; Y) < I(X, Z; Y) for any random variables X, Y, Z. Equality (d) is due to the chain rule of 
mutual information. Inequality (e) is due to the fact that mutual information of two variables is smaller than the 
entropy of each variable, and the last inequality holds because the cardinality of the alphabet of S is |<S|. ■ 



Appendix II 
Proof of Theorem!!] 



E(p e , m ) = ^2J2 p ( xN >y N ) p i error \ m > xN >y N } 

y N X N 

= ^^QC^Hz^-^PCy^llx^Plerrorlm,^,^], (91) 

y N X N 

where P[error\m, x N ,y N ] is the probability of decoding error conditioned on the message m, the output y N and the 
input x N . The second equality is due to Lemma^ Throughout the reminder of the proof we fix the message m. For a 
given tuple (m, x N , y N ) define the event A m i, for each m' ^ m, as the event that the message ml is selected in such 
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a way that P(y N \m') > P(y N \m) which, according to eq. d32i . is the same as P{y N \\x' N ) > P(y N \\x N ) where 
x' N is a shorthand notation for x N (m' , z N ~ 1 (y N ^ 1 )) and x N is a shorthand notation for x (m, z N ~ 1 (y N ^ 1 )). 
From the definition of A m > we have 



p(A m ,\ m ,x N ,y N ) = Y,Q( x ' N \\ zN ~ 1 )- 1 i p (y N \\ x ' N )> p (y N \\ xN " 

x' N 

'P(y N \\x' N y t 



P(y N \\x N ) 



any s > 



(92) 



where I(x) denotes the indicator function. 



P[error\m,x N ,y N ] = P( \J A m ,\m,x N ,y N ) 



< min J P(A m >\m,x N ,y N ),l 

P 



< 



< 



£ P(A m ,\m,x N ,y N ) 



any < p < 1 



iW) 

P(y N \\x' N ) 



1 1' 



0<p<l,s>0, (93) 



where the last inequality is due to inequality ( I92i . By substituting inequality d93l in eq. ( I91l l we get: 



E[P e , m ] < (M-l)"5^ 



y JV L a; 



^Q(x' w ||z 7V - 1 )P(y 7V || 2 ;' Ar ) s 



(94) 



By substituting s = 1/(1 + p), and recognizing that x' is a dummy variable of summation, we obtain eq. i34l and 
complete the proof. ■ 



Appendix III 
Proof of Theorem|9] 



Theorem [3] holds for any distribution of the initial states Sq. In particular, it holds for the case that P(sq) = j^, 
namely, the uniform distribution. By assuming a uniform distribution on the initial state, we get that the likelihood 
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function satisfies 



P(y N \\x N ) C ^ P(y N \m) 



= ^F(/ )So |m) 



® -£P(s )P(y N \m,s ) 

so 

1 N 

= EtciII^^" 1 '" 1 '^ 
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SO ' ' 4=1 



ErtII P( ^i^ l ' m > aji '*ii) 



15, . 

S 4=1 



so ' ' t=l 

Eii^ii^.-o). 



(95) 



Equality (a) is shown in eq. (1321 and Equality (b) holds due to the assumption that the initial state So and the 
message m are independent. Thus, assuming that Sq is uniformly distributed the bound on error probability under 
ML decoding given in Theorem [8] becomes 



£ -l E (p e , m ( S0 )) < (m - iyJ2 1 Eq(* V- 1 ) E w\ p (vV> s «) 



y N I i» 



5 



1 N 1 + P 

(1 + P) 



And therefore, for any initial state sq 

E(p e>m ( S0 )) < \s\(m - ly J2 \ E QO^II*" -1 ) 

Since m was arbitrary, we obtain a fortiori 

E(P e ( So )) < \s\(m - iyJ2 \ E^H^" 1 ) 



E^VV^) 



(i+p) 



i+p 



< p < 1 
(96) 

< p < 1 (97) 



1 >v 1 + P 

IT+pJ 



< p < 1 (98) 



where P e (so) is the probability of error over all messages given that the initial state is sq and the expectation is 
w.r.t the random generation of the code. It is possible to construct a code for 2M messages that this inequality 
holds for the average and then to pick the best M messages such that the bound holds for each message within a 
factor of 4. I.e., we get that for every 1 < m < M, 



p e ,M < 4|5|(m - iyJ2 \ E Q(*V -1 ) 

y N { x N 

By using the inequality (J^ a,i) r < X)i( a «') r f° r < r < 1 we can move the sum over so, yielding 



J2^P(y N \\x\s ) 



i >^ i+p 

(i+p) 



< p < 1 (99) 



PeM < 4|5|(m - i) p E EE^V -1 ) 



yN ^ s J.N 
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W\ 



P(y N \\x N ,s ) 



i \ i+p 

(i+p) 



< p < 1 (100) 
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Furthermore, we can move the sum over sq once again by rearranging the sum and then using the Jensen's inequality: 

y N { s 1 1 a:" J 

® 4|5|(M - 1 W E IE ^ E OC^II^" 1 ) [Pft^H^, -o)] ^ ) 

y N [_ a ' ' X N ) 

< 4|5|(M - 1)"\S\" EE4t {E Q^H^- 1 ) [P(vV,s )} ^ ) 

< 4 (m - iri^r EE E oc^ii^- 1 ) [*W> «>)] ^ 

( "\ 1+P 

< 4\S\(M -iy\S\ p m&xJ2\j2Q( xN W zN ~^ [P(y N \\x N ,s )] TT ^ 1 (101) 



i+p 



i+p 



9 N k l» 



Inequalities (a) and (b) are achieved by moving the the term ™ outside the sums. Inequality (c) is achieved 
by applying Jensen's inequality (X^-Pi a i) r ^ Si^( a «) r - Inequality (d) holds because the number of elements 
multiplied by the maximum element is larger than the sum of elements. Because the inequality holds for all 

Q{x N \\z N - 1 ), 

Pe. m (so) < 4|5|(M-1)"|5|" min maxV \ V Q^Wz"- 1 ) [P(y N \\x N , s )] V™ \ (102) 
By substituting M = 2 B and eq. ( I36l > and d37t into ( 11021 1, we prove the theorem. ■ 

Appendix IV 
Proof of Lemma[TT1 

Let us divide the input x N into two sets xi = and x 2 = %n+i- Similarly, let us divide the output 
y N into two sets yi = y" and y 2 = Vn+i an d me feedback z N into zi = z"^ 1 and Z2 = z^+x- Let 
Qn(xi||zi) = nILi fC^ila; 1 ,^ 1 ) and Qi(x 2 ||z 2 ) = n<=i ^(^n+iK+i^n+i" 1 ) be the probability assignments 
that achieve the maxima F n (p) and Fi(p), respectively. Let us consider the probability assignment Q(x N \\z N ~ 1 ) = 
Qn(xi||zi)Qi(x2||z 2 ). Then 

F N >^l^+E , N (p,Q(x N \\z N ' 1 , S ' () ) (103) 

where s' Q is the state that minimizes E Ot N(p,Q(x N \\z N ~ 1 ),s' ). 
Now, 

P(y N \\x N ,a' ) ( => P(y N \m,s' ) 

= E p (^ s «K s o) 

S n 

= E P ^ yi ' Sn ' TO ' S o) P (y2|™, Sn, Yl, So) 

= E P (y i ' S ™l TO ' S o)^(y2||x2,S„) (104) 



30 



Equality (a) can be proved in the same way as eq. J32i was proved. The term P(yi, s n \m, s' ) can be also expressed 
in terms of yi,xi in the following way: 



hence we obtain: 
Consequently, 

2 l-NF N (p)] 
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(105) 
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(107) 



Inequality (a) is due to inequality 11031 Equality (b) is due to eq. J106i . Inequality (c) holds because of the 
same reason as given in eq. ( I100K namely (X^ a i) r — Ei( a i) r - Inequality (d) is due to Minkowski's inequality 

Ej Pj (Ek *ik) 1/r ] r > E fc fe Pja/k) r for r > I- ■ 



